Something has changed recently thatās broken our use of PagerDuty and I donāt know how to fix it.
Previously, weād generate alerts via e-mail and deduplicate them based on the Ticket number our system generates in the subject line. Anything between # and / would belong to the same incident/alert.
(FYI - Iām using those two words, alert and incident, interchangeably as weāve been with PagerDuty since before these were separate things. Iām still not totally sure I understand what the difference is between these two things as it applies to our use of PagerDuty and this might be the source of our problems.)
So, as I was saying any e-mail with the same ticket number in the subject line would be deduplicated. This is because our system will send out multiple e-mail alerts about the same ticket every 10 minutes until the issue was addressed in our system. This allowed PagerDuty to only have one incident/alert in the system and generate only one page.
Once the issue was addressed in our system, an e-mail would kick off to PagerDuty closing out the incident/alert and it would be shown as resolved in the PD system.
If in the future however, the same ticket number would enter a failed state and starting generating alerts again, a new incident/alert would be created in PagerDuty then send off the page. Again, additional messages from our system about the same issue would be deduplicated until it was addressed in our system and the āall clearā was sent to PagerDuty to close out the incident/alert.
The problem is that second thing is no longer happening.
Once a ticket generates an alert/incident in PagerDuty and it gets resolved, if that same ticket goes back into a failed state again at a future date and the e-mail with the alert gets sent off to PagerDuty, PagerDuty does nothing with that e-mail. It doesnāt generate another incident/alert and we donāt get paged.
This is obviously horrible as this means weāre missing important pages.
Iām not sure when this problem started happening. As pages were still coming in it seemed like it was working, but at least once or twice something that should have generated a page didnāt generate one and I just figured it might be a fluke. The same thing happened a third time this morning so I started doing testing and figured out the pattern mentioned above.
Anyone have any idea what we might need to change to work inside this new incident/alert system? Iām assuming thatās somehow related as to why this is no longer triggering a new alert/page⦠but I could be wrong. Iām just guessing and grabbing at straws here.
Thanks for any help or ideas you might have.
ā Justin Swall